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Abstract 

Twitter updates now represent an enormous stream of 
information originating from a wide variety of for¬ 
mal and informal sources, much of which is relevant 
to real-world events. In this paper we adapt existing 
bio-surveillance algorithms to detect localised spikes in 
Twitter activity corresponding to real events with a high 
level of confidence. We then develop a methodology to 
automatically summarise these events, both by provid¬ 
ing the tweets which fully describe the event and by 
linking to highly relevant news articles. We apply our 
methods to outbreaks of illness and events strongly af¬ 
fecting sentiment. In both case studies we are able to 
detect events verifiable by third party sources and pro¬ 
duce high quality summaries. 


1 Introduction 


Updates posted on social media platforms such as Twitter 
contain a great deal of information about events in the phys¬ 
ical world, with the majority of topics discussed on Twitter 
being news related ( Kwak et al. 2010| l. Twitter can therefore 
be used as an information source in order to detect real world 
events. The content and metadata contained in the tweets can 
then be leveraged to describe the events and provide context 
and situational awareness. Applications of event detection 
and summarisation on Twitter have included the detection 
of disease outbreaks (|Aramaki, Maskawa, and Morita 20lT), 


natural disasters such as earthquakes (Sakaki, Okazaki, and 


Matsuo 20101 and reaction to sporting events (Zubiaga et al. 
WiTj. 


Using the Twitter stream for event detection yields a va¬ 
riety of advantages. Normally in order to automatically de¬ 
tect real-world events a variety of official and media sources 
would have to be tracked. These are usually published with 
some lag time, and any system monitoring them program¬ 
matically would require customisation for each source since 
they are not formatted in any standard way. Twitter pro¬ 
vides a real-time stream of information that can be ac¬ 
cessed via a single API. In addition a rich variety of sources 
publish information to Twitter, since it is a forum both 
for the traditional media and for a newer brand of citizen 
journalists (|Hermida 2010[). Tweets also contain metadata 
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that can be mined for information, including location data, 
user-supplied hashtags and user profile information such as 
follower-friend relationships. The primary drawback of us¬ 
ing Twitter is that it is an unstructured source that contains 
a great deal of noise along with its signal. Tweets can be in¬ 
accurate as a result of rumour, gossip or active manipulation 
via spamming. 

In this paper we apply existing bio-surveillance algo¬ 
rithms to detect candidate events from the Twitter stream, 
employing customised filtering techniques to remove spuri¬ 
ous events. We then extract the terms from the event tweets 
which best characterise the event and are most efficacious 
in retrieving related news. These terms are used to filter and 
rank the most informative tweets for presentation to the user 
along with the most relevant news articles. 

Our techniques are evaluated using two case studies, both 
using a dataset of geo-located tweets from England and 
Wales collected in 2014. The primary case study is the detec¬ 
tion of illness outbreak events. We then generalise our tech¬ 
niques to events strongly affecting Twitter sentiment, such 
as celebrity deaths and big sports matches. 

In Section 12] we discuss related work in the area of event 
detection and situational awareness using Twitter. Sections]^ 
andj^outline our methodology and results. We then discuss 
our conclusions in Section |5| 


2 Related Work 

Much of the work on event detection using social media has 
focused on using topic detection methods to identify break¬ 
ing news stories. Streaming document similarity measures 
([Petrovic, Osborne, and Lavrenko 2010)1, ([Osborne et al. 


2014|l and online i ncremental clustering ( jBecker, Naaman, 

and Gravano 201 1) have been shown to be effective for this 


purpose. 

Other approaches have aimed to pick up more localised 
events. These have included searching for spatial clusters in 
tweets ( jWalther a nd Kaisser 2013|), leveraging the social net¬ 
work structure fAggarwal and Subbian 2012]l, analysing the 
patterns of communication activity ( |Chierichetti et al. 2014 1 
and identifying significant keywords by their spatial signa¬ 
ture (Abdelhaq, Sengstock, and Gertz 2013). 


In the field of disease outbreak detection efforts have 
mostly focused on tracking levels of influenza by compar¬ 
ing them to the level of self-reported influenza on Twitter, 






































in st udies such as (jB roni atowski, Paul, and Dredze 2013) 
and ( |Li and Cardie 2013l l. Existing disease outbreak de- 
tection algorithms have also been applied to Twitter data, 
for example in a case study ( Diaz-Aviles et al. 2012| l of 
a non-seasonal disease outbreak of Enterohemorrhagic Es¬ 
cherichia coli (EHEC) in Germany. They searched for tweets 
from Germany matching the keyword “EHEC”, and used 
the daily tweet counts as input to their epidemic detection 
algorithms. Using this methodology an alert for the EHEC 
outbreak was triggered before standard alerting procedures 
would have detected it. Our study uses a modihed and gen¬ 
eralised version of this event detection approach. 

Diaz-Aviles et al. also attempted to summarize outbreak 
events by selecting the most relevant tweets, using a cus¬ 
tomized ranking algorithm. Other studies which have sum¬ 
marised events on Twitter by selecting the most relevant 
tweets include ( |Zubiaga et al. 201^ and ( |Long et al. 2011| l. 

There has been less related work on linking or substanti¬ 
ating events detected from Twitter with traditional news me¬ 
dia. One study ( [Abel et al. 201 l| l analysed various methods 
of contextualizing Twitter activities by linking them to news 
articles. The methods they examined included finding tweets 
with explicit URL links to news articles, using the content of 
tweets, hashtags and entity recognition. The best non-URL 
based strategy that they found was the comparison of named 
entities extracted from news articles using OpenCalais with 
the content of the tweets. 


3 Methodology 

3.1 Problem Definition 

Our definition of a real-world event within the context of 
Twitter is taken from ( |Becker, Naaman, and Gravano 2011) , 
with the exception that we have added a concept of event 
location. 

Definition 1. (Event) An event is a real-world occurrence e 
with (1) an associated time period Tg and (2) a time-ordered 
stream of Twitter messages Me, of substantial volume, dis¬ 
cussing the occurrence and published during time Tg. The 
event has a location Le where it took place, which may be 
specific or cover a large area, and the messages have a set of 
locations LMi,---,LMn which they were sent from. 

When given a time-ordered stream of Twitter messages 
M, the event detection problem is therefore one of identi¬ 
fying the events ei,...,e„ that are present in this stream and 
their associated time periods and messages Me- It is also 
valuable to identify the primary location or locations Lmi 
that messages have originated from, and if possible the event 
location Lg- The situational awareness problem is one of tak¬ 
ing the time period Tg and messages Me and producing an 
understandable summary of the event and its context. 

3.2 Overview 

Our approach to the event detection problem incorporates lo¬ 
cation by detecting deviations from baseline levels of tweet 
activity in specihc geographical areas. This allows us to 
track the location of messages relating to events, and in some 
cases determine the event location itself We break down the 


problem by dehning classes of events which we are inter¬ 
ested in and formulating a set of groups of keywords which 
describe each class. In this paper we have examined two dis¬ 
tinct classes: 

• Outbreaks of symptoms of illness, such as coughing or 

itching 

• Events triggering emotional states, such as happiness or 

sadness 

We track the number of tweets mentioning each keyword 
in each of our areas and use modihed bio-surveillance algo¬ 
rithms to detect spikes in activity which we can classify as 
events. 

Initially we designed the system with health symptom 
event detection as the primary use case. This led to a sys¬ 
tem design focused around keywords and aliases for their 
keywords, since a limited range of illness symptoms char¬ 
acterises most common diseases and the vocabulary used to 
describe these symptoms is also relatively limited. After sev¬ 
eral iterations of this approach we noted that it could be vi¬ 
able as a general event detection and situational awareness 
method, so we added another event class, emotion-based 
events, to test out the feasibility of the general approach. 

Our situational awareness approach is based on identi¬ 
fying terms from the event tweets which characterise the 
events and using them to retrieve relevant news articles and 
identify the most informative tweets. The news search uses 
metrics based on cosine similarity to ensure that searches 
return related groups of articles. 

3.3 Architecture 

The general approach can be described by the architecture 
in Eigure[T] Every new event class requires a list of keyword 
groups. Optionally a domain specihc data pre-processing 
step can also be included. Eor example in the health symp¬ 
tom case we employ a machine learning classiher to remove 
noise (those tweets not actually concerning health). These 
are the only two aspects of the design that need to be altered 
to provide event detection and situational awareness to a new 
problem domain. 

3.4 Event Classes 

We now go into a more detailed explanation of our event 
classes and how we formulated the keyword groups. Each 
keyword group consists of a primary keyword which is used 
to identify the group, e.g. vomit, and a number of aliases that 
expand the group, e.g. throwing up, being sick, etc. 

Illness Symptoms To build up a list of symp¬ 
toms and related keywords we searched Ereebase for 
/medicine/symptom. Each of these symptoms is 
dehned as a primary keyword. They are returned with a list 
of aliases that are used as related keywords. 

The next step in creating a symptom list was to hlter 
these symptoms by their frequency in the Twitter data, since 
only those words actually used on Twitter are of interest. 
All symptoms with less than 10 mentions in the Twitter data 
were removed from this candidate list. This excluded a large 
proportion of symptoms, reducing the set from 2000 to 200. 



















Keyword 

Aliases 


surprise 

amazed, astonished, surprised... 


sadness 

depressed, unhappy, crying... 


joy 

glad, delighted, pleased... 


Table 1: Selected emotion keyword groups and some of 



their aliases: keyword groups contain a primary keyword 
and aliases (taken from Shaver et al .). 


events. This resulted in a data-set of 95,852,214 tweets from 
1,230,015 users. 1.6% of users geo-tag their tweets ( |Lee- 
|taru et al. 2013 i, so our data is a limited sample of the total 
tweet volume from England and Wales during this period. 
We chose to use only geo-tagged tweets since they contain 
metadata giving an accurate location for the user. This al¬ 
lows us to locate each tweet within our geographical model. 




Figure 1; Event Detection and Situational Awareness ar¬ 
chitecture: To apply to a new example a user needs to pro¬ 
vide a keyword group list and optionally a noise filter to re¬ 
move tweets that do not strictly match the criteria of interest. 


We further limited the set by removing symptoms not 
related to infectious diseases. We also added primary key¬ 
words and aliases for some common conditions such as 
hayfever and flu. This step resulted in 46 symptom groups. 

Emotion States For a list of emotion states and associ¬ 
ated keywords we used the work of Shaver et al. . They con¬ 
ducted research ( [Shaver et al. 1987) 1 to determine which sets 
of words were linked to emotions and how these cluster to¬ 
gether. We took the six basic emotions identified in the work 
as primary keywords: love, joy, surprise, sadness, anger and 
fear. Shaver’s work associated each of these with a list of 
terms to form a tree. We took the terms from lower leaves 
on the tree for each emotion as our alias sets (See Table [T] 
for examples). The only alteration we made was that after 
some initial analysis we discovered that the term “happy” 
from the “joy” category was a very strong signal of special 
events such as Valentine’s Day, Mother’s Day and Easter. It 
was also very often used on a daily basis due to people offer¬ 
ing birthday greetings. We therefore separated “happy” into 
its own category separate from “joy”. 


positive and negative emotional sentiment. We took those 
classified as being very positive and very negative as addi¬ 
tional emotion states. 

3.5 Data Collection 

Using Twitter’s live streaming API we collected geo-tagged 
tweets between 11th February 2014 and 11th October 2014. 
Tweets were collected from within a geographical bounding 
box containing England and Wales. Retweets were excluded 
due to our focus on tweets as primary reports or reactions to 


In addition we employed SentiStrength (Thelwall et al. 


20101, a sentiment analysis tool, to classify our tweets into 


3.6 Location Assignment 

Our methodology relies on the collection of baseline levels 
of tweet activity in an area, so that alarms can be triggered 
when this activity increases. We therefore amalgamated the 
fine-grained location information from the geo-coded tweets 
by assigning them to broader geographical areas. We used 
a data driven approach to generate the geographical areas 
rather than using administrative areas such as towns or coun¬ 
ties. This technique allowed us to select only those areas 
with a minimum level of tweet activity, and also did not 
require any additional map data. It would therefore be be 
reusable for any region or country with a sufficient level of 
Twitter usage. 

We began by viewing a sample of the collected tweets 
as geo-spatial points. Viewed on a map these clearly clus¬ 
tered in the densely populated areas of England and Wales. 
We therefore decided to use a clustering algorithm on these 
points in order to separate out areas for study. We employed 
the Density-Based Spatial Clustering of Applications with 
Noise (DBSCAN) algorithm ( Ester et al. j ) for clustering, as 
this does not require a priori knowledge of the number of 
clusters in the data. The features provided to DBSCAN were 
the latitudes and longitudes of the tweets. 

The clusters produced by the algorithm matched the most 
populated areas, corresponding to the largest cities or towns 
in the UK as shown in Figure . They also separated most 
cities into distinct clusters (a notable exception being the 
conglomeration of Liverpool and Manchester). In total 39 
clusters were created for England and Wales and each was 
given an ID and a label. We then created a convex hull 
around each cluster, providing a polygon that can be used 
to check whether a point is in the cluster or outside it. Points 
outside all of the clusters were assigned to a special ’noise’ 
cluster, and not included in the analysis. Overall 80% of 
tweets were assigned to specific clusters and the remainder 
to noise, giving us good coverage of geo-tagged tweets using 
our cluster areas. 


3.7 Tweet Processing 

As tweets are received by our system they are processed and 
assigned to the symptom and emotion state classes via key- 
































Figure 2; UK population density (left) compared to a sample of geo-located tweets (centre) and the clusters found (right). Note 
that only clusters located in England and Wales were used in this study. 


word matching. They are assigned a location by checking 
whether they fall into one of our cluster areas. 

For the illness symptoms we introduce a noise removal 
stage at this point. It is particularly relevant for this class 
of events because there are many fewer tweets relating to 
illness than showing emotion states. This means that the sig¬ 
nal is more easily blocked out by random noise. To remove 
noise we construct a machine learning classifier with the 
aim of removing tweets containing alternative word usages 
or general illness discussion rather than reporting of illness 
events. The classifier we use is a linear SVM trained on a 


semi-supervised cascading training set (Sadilek, Kautz, and 
Silenzio 2012|l. This classifier uses the LibSVM (Chang and 


Lin 2011|l library, and achieves a classification accuracy of 


96.1% on a test set of manually classified tweets. 

The number of tweets assigned to each class in each area 
are then saved on a daily basis. These counts are first nor¬ 
malised to take account of Twitter’s daily effect pattern, 
which shows more tweeting on weekends than weekdays. 
Event detection is run daily since we are attempting to pick 
up temporally coarse-grained events. Disease outbreaks take 
weeks to develop, and events that shift public sentiment or 
emotion will generally take hours or days to unfold. 


3.8 Detecting Events 

Our event detection methodology leverages considerable ex¬ 
isting syndromic surveillance research by using an algorithm 
designed and developed by the Centers for Disease Control 
and Prevention (CDC), the Early Aberration Reporting Sys¬ 
tem (EARS) ( |Hutwagner et al. 2003 1. 

Definition 2. (Alarm) An alarm is an alert produced by the 
first stage of our event detection system. The alarm has an 
associated symptom and location. It also has a start and end 
date, and associated tweet counts for each date within this 
period. When certain criteria are met an alarm is deemed to 
be an event. 

We employ the C2 and C3 variants of EARS. These algo¬ 
rithms operate on a time series of count data, which in our 
case is a count of daily symptomatic tweet activity. The C2 
algorithm uses a sliding seven day baseline, and signals an 
alarm for a time t when the difference between the actual 


count at t and the moving average at t exceeds 3 standard 
deviations. The C3 algorithm is based on C2, and in effect 
triggers when there have been multiple C2 alarms over the 
previous 3 days. 

These C2 and C3 candidate alarms are then grouped to¬ 
gether so that alarms for the same keyword set and area on 
consecutive days are treated as a single alarm. An alarm is 
therefore made up of one or more days, each with an ob¬ 
served count of tweets. 

Some of our Twitter count time series data is zero-skewed 
and non-normal, since the number of geo-tagged users re¬ 
porting illness can be low. The number of standard devia¬ 
tions from the mean used in the C2 and C3 algorithms can be 
an unreliable measure of central tendency in those circum¬ 
stances. Hence to determine how far above general baseline 
activity an observed count is we employ the median of the 
series to date and the Median Absolute Deviation (MAD) to 
produce a new metric of alarm severity. The number of Me¬ 
dian Absolute Deviations from the median, /i, gives a com¬ 
parable figure across alarms as to how sharp a rise has been 
over expected levels. This figure is produced from the fol¬ 
lowing equation; 

/i = {observation — median) / M AD (1) 

We then find the highest metric for an alarm, ^max , by 
finding the highest value of ^ within the observations mak¬ 
ing up the alarm. 

fJ'max = arg inax{observations in alarm) (2) 

The ^max is the primary statistic which we use to deter¬ 
mine which events are real and which have just been gener¬ 
ated by random noise. Details of the threshold value which 
we use for this and how we selected it are contained in Sec¬ 
tion 0] 

Another statistic which we employ in order to filter out 
noise is the tweet-user ratio. This is the ratio of tweets in 
an event to that of distinct users involved in an event. A 
high value of this statistic would imply that some users have 
tweeted a large number of times across a short time period, 
which is an indication that they may be spammers and that 
the alarm is spurious. 


















In summary, we use the output from EARS to produce 
alarms. We filter the alarms to a set of high likelihood events 
by using the /imax tweet-user ratio parameters. 

3.9 Situational Awareness 

Once an event has been identified our next objective is to 
automatically provide additional context for it, which may 
provide an explanation of the underlying cause. A human 
interpreter could achieve this by reading all of the tweets 
and synthesizing them into a textual explanation, which 
might be some text such as “People reacting to the death of 
Robin Williams”. We do this in two main ways: by provid¬ 
ing the most representative tweets from those that triggered 
the alarm, and by linking to relevant news articles. The steps 
involved in the Terms, News and Tweets (TNT) Event Sum¬ 
marisation process are detailed in Algorithm [T] The steps 
and terminology are then explained in more detail. 


Algorithm 1 Terms, News and Tweets (TNT) Event Sum¬ 
marisation 

Fetch gist tweets and baseline tweets 
( 2 ) if Igist tweets I < 30 then 
(T) Do not attempt to summarise event 
(J) else 

(?) Extract unigrams and bigrams appearing in at least 
5% of the gist tweets 

for all ngrams extracted do 

Q Perform Fisher’s Exact Test to determine 

whether ngram is significantly more likely to appear in 
gist than baseline 

(?) for Top 2 most significant unigrams and bigrams 
and the primary keyword do 

Search news database using ngram for the 
alarm’s date range and return the top 10 documents 
Compute PCSS for documents returned 

( 1 ^ for ngrams with PCSS values above threshold do 
(i^ Compute title similarity PCSS between ngram 

documents and those for each other ngram 
( 1 ^ Good search terms ^ term with title similarity 

PCSS above threshold 

( 1 ^ Good articles ^ documents returned from good 
search terms 

(2) Filtered tweets ^ tweets containing a good search 
term 

( 1 ^ Rank good articles by cosine similarity to average 
vector of good news articles 

Rank filtered tweets by cosine similarity to average 
vector of filtered tweets 


(T) The first step is to retrieve the relevant tweets from 
the processed tweet and alarm databases. Tweets are fetched 
for both the alarm gist and from a historical baseline. @ 
We discard those events with fewer than 30 tweets as we 
found that they did not contain sufficient data to produce 
good summarisation results. 

Definition 3. (Gist) The gist consists of the tweets for the 


time period of the event which match the event’s keyword 
group and area. 

Definition 4. (Baseline) The baseline consists of the tweets 
for the same keyword group and area as an event from the 
28 days prior to that event. 

The next task is to find unigrams and bigrams that 
are more prevalent in the gist than in the baseline. These 
are likely to come from tweets discussing the event and will 
thus be characteristic of the event. We first extract the most 
common unigrams and bigrams from both sets of tweets, 
after removal of stopwords. Our list of stopwords includes 
a standard list, plus the 200 most frequent words from our 
tweet database. We select all non-stopwords that appear in 
at least 5% of the tweets. 

(T) We then do a Fisher’s Exact Test to determine which of 
the common unigrams and bigrams in the gist appear signif¬ 
icantly more frequently (a < 0.05) here than in the baseline 
set. Our candidate terms are the top two most significant un¬ 
igrams and bigrams. We select the top two as this was found 
to give the best results on our test examples. To this set we 
append the primary keyword that triggered the alarm. 

(® Using these candidate terms we then perform a search 
on Google for documents published in the United Kingdom 
during the time period of the alarm. Due to Google’s Terms 
of Service this step was performed manually. A fully auto¬ 
mated system would replace this step with a search of a news 
database, which could be created by pulling down news ar¬ 
ticles from RSS feeds of major content providers. 

We take the first 10 documents retrieved for each 
search term, remove stopwords and apply stemming using 
a Lancaster stemmer. We then convert each document into 
a Term Frequency/Inverse Document Frequency (TF/IDF) 
vector. In order to determine whether the search term has 
retrieved a coherent set of related documents we define a 
metric based on cosine similarity, the Pairwise Cosine Sim¬ 
ilarity Score (PCSS). 

• The Pairwise Cosine Similarity Score of a group of 
TF/IDF vectors is calculated by taking the cosine simi¬ 
larity between each pair of vectors and adding them to a 
set. The standard deviation of this set is subtracted from 
its mean to form a score. 

The PCSS rewards articles which are similar and pe¬ 
nalises any variance across those article similarities, this re¬ 
duces the effect of some articles being strongly related in 
the document set and others being highly unrelated. Any 
term which retrieves a set of documents with a score be¬ 
low a threshold value (determined by a parameter selection 
process detailed in section]^ is not considered further. 

It is possible for a search term to hit on a coherent set of 
documents purely by chance, perhaps by finding news arti¬ 
cles related to another event in a different part of the world. 
In order to guard against this we institute another check to 
ensure that the set of documents returned from a search term 
is sufficiently closely related to the set returned from at least 
one other search term. 

( 1 ^ In order to perform this check we compare the titles 
of the articles returned from the two different searches using 





a similar process to our earlier document comparison. We 
found it more effective to compare titles than whole docu¬ 
ments, since sets of documents with similar topics can con¬ 
tain similar language even for fairly unrelated search terms. 
For example the terms “ebola” and “flu” will both return 
health-related documents containing similar language, but 
we would not wish to say that these search terms are related. 
To convert the titles to TF/IDF vectors we remove stopwords 
but do not apply stemming. Since the titles are so short we 
include all unigrams, bigrams and trigrams in the vector rep¬ 
resentation. We then compute a PCSS between the two doc¬ 
ument sets, pairing each document in the first set with each 

in the second and vice versa. A search term must be 
related to at least one other term for it to be used going for¬ 
ward. 

(l^ Once TNT has identified good search terms we then 

return the news articles fetched using those terms. (5^ In 
order to rank the top news articles for a search we take the 
average TF/IDF vector and then rank the articles by cosine 
similarity to this average vector. We return the top ranked 
articles from each search term. 

In order to return the most explanatory tweets we find 
the gist tweets that contain at least one of the good search 
terms. We then convert these into TF/IDF vectors and com¬ 
pute the average vector. The tweets are then ranked in the 
same way, by cosine similarity to the average vector, and we 
return the top 5 tweets. 


4 Results 

There are three individual components to our event detection 
and situational awareness platform that require evaluation; 

1. Event detection 

2. Situational awareness 

(a) Linkage of relevant news articles 

(b) Ranking most informative tweets 

4.1 Example Cases 

To effectively evaluate all of these components required a 
varied set of example events and alarms. These were used 
in order to choose values for our threshold parameters. We 
compiled an initial set of 13 focus examples. These were 
taken from events that the authors knew had happened in the 
evaluation time period and from those alarms in our dataset 
with low and high values of Umax- The event ID which will 
be used to refer to these events is composed of the first two 
letters of the event keyword followed by a 1-2 letter area 
code. The final part of the ID is the day and month of the 
event start date. 

The focus examples were used to find sensible values 
that separated the high-confidence events from the low- 
confidence events. The most important threshold parame¬ 
ter in the context of the event detection is the ^max figure 
which measures the deviation of the alarm counts from the 
median level. Examining the distribution of the number of 
alarms for each value of ^Xmax revealed that it started to tail 
off sharply at ^max > 5. We therefore took this as a value to 


segment additional test examples, drawing ten more at ran¬ 
dom with a /imax less than 5 and ten with a ^Xmax greater 
than or equal to 5. 


4.2 Event Detection Evaluation 


Method It is difficult to provide a completely automated 
evaluation procedure for detecting previously unknown 
events. Diaz et al. used the time to detection on a known out¬ 
break as their evaluation criterion ( |Diaz-Aviles et al. 2012| l. 
In our case we do not know a priori that these are genuine 
outbreaks or events. Hence we need to make an assessment 
of the alarms produced to see what they refer to and if there 
is a way of externally verifying that they are genuine events. 
Eor all 33 of the selected alarms the authors read the tweets 
and determined whether they described a real world event. 
The coders found 26 YES answers, 5 NO answers and 2 
DISAGREED answers, producing a 94% agreement. Where 
an event was present they wrote a short summary, 

Eor external verification of events two different methods 
were used, depending on whether the event was symptom- 
related or emotion-based. Eor symptom related events the 
activity spike was checked against official sources for the 
same time period. The General Practitioner (GP) in hours 


bulletin for England and Wales (Public Health England 
[3(Tri was used and an event was deemed verified if the 
symptom exhibited an increasing trend for that period. This 
detail is noted in the summary document produced by Pub¬ 
lic Health England for that reporting period. Emotion-based 
events were verified by checking if there were any articles 
(via Web search) that could corroborate the cause of the 
event (as given by the summary). 

We manually investigated all examples from the initial fo¬ 
cus set and found initial parameters for the score functions in 
our algorithms that worked reasonably well. These provided 
possible ranges of values which were evaluated more sys¬ 
tematically over the entire alarm set. Eor event detection we 
evaluated which alarms were flagged as events by the system 
for each parameter value against whether those events were 
externally verifiable. The final evaluation for all algorithms 
contains all 33 of the alarms in both sets, not just the twenty 
expanded ’test’ examples. 


Results To determine if an alarm is an event that we 
should be concerned about we consider two properties of the 
alarm. The first is the tweet-user ratio. This provides a naive 
spam filter, as when this is high an alarm is mostly caused by 
one user tweeting multiple times. Erom exploratory testing 
we found a value of 1.5 separated our spam and genuine 
alarms very well, leaving only a small number of alarms 
with large tweet sets and some spam. The spam detection 
problem should be straightforward and will be addressed 
more completely in future work. 

The second figure which gives the strength of the activ¬ 
ity above the usual baseline is the Pmax figure. This is the 
essence of the modified EARS algorithm and the value of 
this figure should generally separate events from non-events. 

The criterion for selecting the best threshold for pmax is 
context dependent. We have used the balanced El measure 
for this scenario as that is a fair representation of both pre- 







ID 

Event 

l^max 

Keyword 

Node 

ID 

Event 

f^max 

Keyword 

Node 

SAE-11-08 

YES 

20 

Sadness 

Eondon 

HFB-10-04 

YES 

5 

Hayfever 

Birmingham 

HFM-01-06 

YES 

19 

Hayfever 

Manchester 

VOE-20-04 

YES 

5 

Vomit 

Eondon 

SAE-07-04 

YES 

14 

Sadness 

Eondon 

SAC-05-05 

YES 

5 

Sadness 

Cardiff 

FEE-18-07 

YES 

13 

Fear 

Eondon 

HFE-04-07 

NO 

5 

Hayfever 

Eondon 

ASE-02-04 

YES 

12 

Asthma 

Eondon 

FEB-23-09 

NO* 

5 

Flu 

Birmingham 

FEP-06-10 

YES 

11 

Flu 

Portsmouth 

VPBR-10-05 

YES 

4 

VeryPos 

Bristol 

HAM-02-04 

YES 

9 

Happy 

Manchester 

FRE-30-05 

YES 

4 

Fever 

Eondon 

HAM-18-04 

YES 

9 

Happy 

Manchester 

FEM-19-09 

YES 

4 

Flu 

Manchester 

SAE-08-07 

YES 

8 

Sadness 

Eondon 

VOE-22-02 

NO 

3 

Vomit 

Eondon 

HAEE-01-08 

YES 

8 

Happy 

Eeeds 

HFB-29-04 

NO 

3 

Hayfever 

Birmingham 

HFE-14-05 

YES 

7 

Hayfever 

Eeeds 

JONO-23-02 

YES 

2 

Joy 

Norwich 

SUN-29-08 

YES 

7 

Surprise 

Newcastle 

HEM-06-03 

NO 

2 

Headache 

Manchester 

ITE-08-06 

YES 

6 

Itch 

Eondon 

SUC-23-05 

NO 

2 

Surprise 

Cardiff 

SAB-09-06 

YES 

6 

Sadness 

Birmingham 

SUE-16-08 

NO 

1 

Surprise 

Eondon 

HABE-01-03 

YES 

5 

Happy 

Bridgend 

FEBR-17-04 

NO 

0 

Fear 

Bristol 

SAE-21-03 

YES 

5 

Sadness 

Eondon 

STE-26-08 

NO 

0 

Sore Throat 

Eondon 

HFC-09-04 

YES 

5 

Hayfever 

Cardiff 







Table 2: Evaluation set of events: showing whether they were externally verifiable and their ^max value. *Note; this event 
not conhrmed by the GP in hours report of that week. However, the following week showed an increase and it is possible that 
social media detected increased Influenza activity before this was conhrmed by GP visits. 


cision and recall. The classihcation success and error types 
are: 

• True positive: instances at or above the threshold that are 
verihed events 

• False positive: instances at or above the threshold that are 
not verihed events 

• True negative: instances below the threshold that are not 
verihed events 

• False negative: instances below the threshold that are ver¬ 
ihed events 

The precision, recall and FI values for all the tested values 
of /imax are displayed in Figurej^ The maximum FI value, 
0.9362, is observed at Umax > 4, so this is a well balanced 
threshold and the recommended parameter. Those seeking 
higher conhdence events (willing to accept that some events 
may be missed) could use a value of 6 for this parameter 
which yields a precision of 1. The maximum observed recall 
value is at the minimum parameter value and is not very in¬ 
formative. Essentially it says that everything is an event and 
hence does not produce any false negatives. 

In summary the event detection mechanism based on the 
EARS C2 and C3 algorithms with the addition of the ^.max 
and tweet-user ratio was found to perform well at detecting 
events that could be externally verihed as genuine. The rec¬ 
ommended ^max parameter (4) produced a good balance of 
precision and recall in our sample set. It must be noted how¬ 
ever that we cannot gain a true picture of the overall recall of 
the system, since we have no way of analysing the number 
of genuine events that were not picked up. 

4.3 Situational Awareness Evaluation 

Both situational awareness components were evaluated. 
Eirstly the news linkage was tested to see whether relevant 
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Eigure 3: /imax event detection parameter selection 


news was retrieved for the sample events. As part of this 
analysis we compared our method of extracting informative 
search terms (the TNT algorithm) with a comparable auto¬ 
mated technique. Secondly the tweet ranking was validated 
to determine whether highly ranked tweets effectively sum¬ 
marised the events. 

Comparative News Linkage Evaluation The news link¬ 
age component works by selecting good search terms for ar¬ 
ticles based on the TNT algorithm. Within this there is a term 
extraction step to generate search terms, and then a filtering 
step using PCSS to remove terms which retrieve unrelated 
sets of articles. We iterate over different threshold values for 
the PCSS score to hnd the optimum, using an F0.5 measure 
as the evaluation criterion. F0.5 was selected because pre¬ 
cision was judged to be more important than recall in this 
setting. As a further evaluation we compare the results of re¬ 
placing our term extraction algorithm with Latent Dirichlet 
Allocation (EDA). EDA is a popular topic modelling tech- 





















nique that extracts sets of terms characterising each topic in 
a group of documents. The success and error types used to 
compute the F0.5 measure are; 

• True positive; relevant news returned for newsworthy 
event 

• False positive; news returned for an event with no gen¬ 
uine news 

• True negative; no news returned for an event with no gen¬ 
uine news 

• False negative; no news returned for newsworthy event 

The evaluation is presented in Figure as well as the dif¬ 
ferent levels of article PCSS that were iterated over to find 
the maximum F0.5 value in a step-wise procedure. It is clear 
from these images that the TNT algorithm has a higher F0.5 
at all tested values of the article PCSS, due to its higher re¬ 
call. The outcome of the parameter selection process was 
that a PCSS threshold of —0.08 produced the best results. 
Using this value the F0.5 was 0.79, showing that our sys¬ 
tem was successful in retrieving relevant news for the sam¬ 
ple events. 



(a) News linkage accuracy from Terms, News, Tweets terms 



(b) News linkage accuracy from Latent Dirichlet Allocation 
terms 

Figure 4; Comparison of TNT and LDA event term ex¬ 
traction methods for linking social and news media 


Match 

Count 

Full 

21 

Partial 

2 

No 

3 


Table 3; STT tweet ranking evaluation: The STT tweet 
summary fully matched the human-coded event summarisa¬ 
tion in 21 cases. This yields a full match fraction of 0.81. 


Selecting top ranked relevant news articles is one part of 
our situational awareness contribution. The second is the se¬ 
lection of tweets that provide a representative summary of 
an event. 

Top Ranked Tweets Evaluation We select the summary 
tweets by choosing the top 5 tweets ranked by calculating 
the maximum cosine similarity between an average tweet 
TF/IDF vector and all tweets in the candidate set. This tweet 
set can be; 1) all tweets in the gist, 2) those filtered by se¬ 
lecting the extracted terms or 3) those from the filtered term 
set, that is, the extracted term set less any that don’t have a 
good news match. 1) is always available and is labelled the 
Gist Top Tweets (GTT). If terms have been found to be sig¬ 
nificantly different in frequency from the baseline then set 
2) is available for use and if terms from that set have good 
news matches then set 3) can be used. The Summary Top 
Tweets (STT) are from set 3) if it exists and fallback to set 2) 
if the good news match terms are not available. If no terms 
were found to be significantly different from the baseline 
then only the GTT is available. 

We have employed two evaluations for the tweet rank¬ 
ing exercise; comparison to human-coded event explanation 
and comparison between GTT and STT. The human-coded 
event explanations were created by both authors after read¬ 
ing through all of the tweets linked to each event. There were 
26 alarms that had an identifiable cause. The tweet rank¬ 
ing match (to human-coded event assessment) performance 
is presented in Table The tweets were considered a full 
match if a human summary of the 5 top ranked tweets would 
match the human-coded event explanation for the whole set 
of tweets. 

The partial matches were; FRL-30-05 (Fever; London, 
May) and FLP-06-10 (Flu; Birmingham, October). These 
events had more than one explanatory cause. Currently our 
algorithms work best in the single event case. The three 
cases that did not match were; JONO-23-02 (Joy; Nor¬ 
wich, February), STL-2 6-08 (Sore throat; London, Au¬ 
gust) and SUN-2 9-08 (Surprise, Newcastle, August). The 
coders disagreed as to whether STL-26-08 was actually 
an event. The remaining two examples were not summarised 
well by the significant tweets as they both exhibited high dis¬ 
parity in terms used to describe a contextually related event 
and SUN-2 9-08 also included a number of spam tweets 
that distorted the results of TNT. 

The second evaluation for the tweet ranking exercise was 
a comparison between the GTT and the STT. A qualita¬ 
tive assessment of the tweets led to the conclusion that STT 
tweets were better in 11 out of 33 cases and there was no sig- 























nificant difference between the two for 21 cases out of 33. In 
one case, FLP-06-10, the GTT included a mention of “flu 
jab” (one of the manually selected terms) which the STT did 
not include. Hence the STT provides an improvement over 
ranking based off the alarm tweets in 1/3 instances. 

4.4 Notable Examples Discussion 

We now discuss four example events that highlight the 
strengths and limitations of our approach. These examples 
are listed in Table 0] 

The first example case is JONO-23-02. From a reading 
of the tweets there were definitely some relating to a sin¬ 
gle event; Norwich City Football Club beating Tottenham 
Hotspur Football Club 1 — 0 in a football match. Both TNT 
and LDA term extraction failed to And terms representative 
of this event. This was due to the disparity of the language 
used; the following example tweets should help elucidate 
this point: 

• #canarycall absolutely delighted with the win :) good per¬ 
formance, good result 

• #yellows almost didn’t go today glad i did 

• so glad i chose to come today!#ncfc 

It is difficult for a term-based solution to And any common 
thread here. Finding the cause of this event would require 
contextual knowledge of football matches, team names and 
commonly employed aliases. The news linkage algorithm 
did initially And a news story for the term “joy” on this date. 
The British Prime Minister “let out a little cry of joy” over 
David Bowie Scottish independence comments (Telegraph, 
Feb 24, 2014). The articles returned all concerned this story 
and were found to be closely related, but were dropped from 
the news linkage because they did not match those returned 
from the other search terms. This highlights the beneflts of 
searching with multiple terms and ensuring that the results 
are related. 

The second example is ASL-02-04. This event was due 
to increased levels of air pollution observed in London at 
the beginning of April, caused by a Saharan dust cloud. This 
event had a fimax of 12 indicating a signiflcant increase in 
baseline activity for the alert period. It was well summarised 
by all aspects of our situational awareness algorithm. The 
top ranked tweets provided by our summary method (STT) 
produced tweets more representative of the event than those 
from all tweets in the gist. This is demonstrated by the top 
tweet selected by both; 

• STT top tweet: i can’t breathe #asthma #smog 

• GTT top tweet; my asthma is literally so bad 

Here selecting the top tweets from the Altered event set 
captures tweets representative of the event as opposed to the 
baseline illness activity. The news linkage for this example 
worked well, with all flve of the top selected articles being 
representative of the event. The top article, “Air pollution 
reaches high levels in parts of England - BBC”, gives the 
cause of the event in the flrst few lines: “People with health 
problems have been warned to take particular care because 


of the pollution - a mix of local emissions and dust from the 
Sahara.” 

The third case is VOL-20-04. Reading the tweets makes 
it clear that this one day event is caused by people feeling 
sick after eating too much chocolate on Easter Sunday. In 
this case the TNT summary and all tweet ranking return sim¬ 
ilar tweets as there is little baseline activity and that baseline 
activity is not strongly related. The top tweets from both sets 
therefore both produce good summaries: 


• STT top tweet: seriously ifeel sick having all this choco¬ 
late 

• GTT top tweet: eaten too much chocolate feel sick 

While the top ranked tweets are similar the event tweet 
Altering does remove baseline tweets referring to general ill¬ 
ness. No good news searches were found in this case. This 
event may be valid in the context of social media but it is not 
newsworthy. 

The fourth example is SAL-11-0 8 which is the UK 
Twitter reaction to the death of Robin Williams. These 
tweets from the sadness keyword group exhibit both the 
highest pmax (20) and the highest overall tweet count for 
any single event (4472). The prominence of celebrity deaths 
within our detected events mirrors earlier flndings (Petrovic, 
Osborne, and Lavrenko 2010 1 . As with all of our high timax 
events the TNT tweet ranking and news linkage work well. 
The top news article returned is an article reporting the death 
of Mr. Williams; “Robin Williams dies aged 63 in suspected 
suicide” (Telegraph, August 12, 2014). The top flve ranked 
tweets by TNT tweet Altering are better than those ranked 
on all tweets as they remove baseline general sadness tweets 
from the ranking; 


• STT top tweet: rip robin williams, sad day 

• GTT top tweet: yep , very sad 


5 Conclusion 

We have presented techniques for event detection and situa¬ 
tional awareness based on Twitter data. We have shown that 
they are robust and generalisable to different event classes. 
New event classes could be added to this system simply 
by producing a list of keywords of interest and an optional 
noise Alter. Our event detection is based on the EARS bio¬ 
surveillance algorithm with a novel Altering mechanism. 
The maximum Median Absolute Deviations from the me¬ 
dian provides a robust statistic for determining the strength 
of relative spikes in count-based time series. As it is based 
on the median, this measure handles cases where data is non¬ 
normal as was the case for some of our symptom based geo- 
tagged tweets. The event detection approach achieved an El 
score of 0.9362 on our event examples. 

By Altering to terms that are signiflcantly different (a < 
0.05) in frequency from baseline levels we have extracted 
terms to search news sources for related articles. Where 
good news matches are found these revise our event term 
list. We have created two novel algorithms that provide addi¬ 
tional situational awareness about an event from these event 
terms. 






ID 

TNT Terms 

LDA Terms 

JONO-23-02 

ASL-02-04 

VOL-20-04 

SAL-11-08 

joy, enjoy 

asthma, air pollution, smog, pollution 
vomit, chocolate, easter 

sadness, robin williams, sad news, robin, williams 

enjoy, glad, loss 

asthma, smog, pollution, attack air 
chocolate, eaten, easter, vomit, headache 
sad, robin, williams, rip, riprobinwilliams 


Table 4; Example cases and the terms extracted for them: top terms selected either by TNT or LDA. 


Firstly, we rank the filtered set of news articles to pro¬ 
duce the top five representative articles. The news linkage, 
weighted towards precision, achieved an F0.5 score of 0.79 
on our example set, with no false positives. 

Secondly, we produce a top five ranked list of tweets 
that summarise an event. These ranked tweets are calcu¬ 
lated from the tweet set, filtered by those that contain the 
extracted event terms. The top ranked tweets fully matched 
our human-coded event summaries in 21 out of 26 cases. 

In future work we aim to improve our news linkage algo¬ 
rithm with a final step checking whether the articles returned 
are similar to the event tweets, using cosine similarity or 
other features such as entities identified in the news articles. 
Additional improvements to event detection would lie in im¬ 
proving spam detection and adding sentiment classification 
to our emotion example as a classifier. Collecting data over 
longer time periods would also allow us to look into using 
bio-surveillance algorithms which require seasonal baseline 
information. 
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